Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development

Identifieur interne : 001553 ( Main/Exploration ); précédent : 001552; suivant : 001554

Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development

Auteurs : Paul Baker ; Andrew Hardie ; Tony Mcenery ; Richard Xiao ; Kalina Bontcheva ; Hamish Cunningham ; Robert Gaizauskas ; Oana Hamza ; Diana Maynard ; Valentin Tablan ; Cristian Ursu ; B. D. Jayaram ; Mark Leisher [États-Unis]

Source :

RBID : ISTEX:FFB348324CCE43FC1008E9CDA475307B9BB54003

Abstract

This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.

Url:
DOI: 10.1093/llc/19.4.509


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development</title>
<author>
<name sortKey="Baker, Paul" sort="Baker, Paul" uniqKey="Baker P" first="Paul" last="Baker">Paul Baker</name>
</author>
<author>
<name sortKey="Hardie, Andrew" sort="Hardie, Andrew" uniqKey="Hardie A" first="Andrew" last="Hardie">Andrew Hardie</name>
</author>
<author>
<name sortKey="Mcenery, Tony" sort="Mcenery, Tony" uniqKey="Mcenery T" first="Tony" last="Mcenery">Tony Mcenery</name>
</author>
<author>
<name sortKey="Xiao, Richard" sort="Xiao, Richard" uniqKey="Xiao R" first="Richard" last="Xiao">Richard Xiao</name>
</author>
<author>
<name sortKey="Bontcheva, Kalina" sort="Bontcheva, Kalina" uniqKey="Bontcheva K" first="Kalina" last="Bontcheva">Kalina Bontcheva</name>
</author>
<author>
<name sortKey="Cunningham, Hamish" sort="Cunningham, Hamish" uniqKey="Cunningham H" first="Hamish" last="Cunningham">Hamish Cunningham</name>
</author>
<author>
<name sortKey="Gaizauskas, Robert" sort="Gaizauskas, Robert" uniqKey="Gaizauskas R" first="Robert" last="Gaizauskas">Robert Gaizauskas</name>
</author>
<author>
<name sortKey="Hamza, Oana" sort="Hamza, Oana" uniqKey="Hamza O" first="Oana" last="Hamza">Oana Hamza</name>
</author>
<author>
<name sortKey="Maynard, Diana" sort="Maynard, Diana" uniqKey="Maynard D" first="Diana" last="Maynard">Diana Maynard</name>
</author>
<author>
<name sortKey="Tablan, Valentin" sort="Tablan, Valentin" uniqKey="Tablan V" first="Valentin" last="Tablan">Valentin Tablan</name>
</author>
<author>
<name sortKey="Ursu, Cristian" sort="Ursu, Cristian" uniqKey="Ursu C" first="Cristian" last="Ursu">Cristian Ursu</name>
</author>
<author>
<name sortKey="Jayaram, B D" sort="Jayaram, B D" uniqKey="Jayaram B" first="B. D." last="Jayaram">B. D. Jayaram</name>
</author>
<author>
<name sortKey="Leisher, Mark" sort="Leisher, Mark" uniqKey="Leisher M" first="Mark" last="Leisher">Mark Leisher</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:FFB348324CCE43FC1008E9CDA475307B9BB54003</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1093/llc/19.4.509</idno>
<idno type="url">https://api.istex.fr/document/FFB348324CCE43FC1008E9CDA475307B9BB54003/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001721</idno>
<idno type="wicri:Area/Istex/Curation">001626</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D88</idno>
<idno type="wicri:doubleKey">0268-1145:2004:Baker P:corpus:linguistics:and</idno>
<idno type="wicri:Area/Main/Merge">001604</idno>
<idno type="wicri:Area/Main/Curation">001553</idno>
<idno type="wicri:Area/Main/Exploration">001553</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development</title>
<author>
<name sortKey="Baker, Paul" sort="Baker, Paul" uniqKey="Baker P" first="Paul" last="Baker">Paul Baker</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Hardie, Andrew" sort="Hardie, Andrew" uniqKey="Hardie A" first="Andrew" last="Hardie">Andrew Hardie</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Mcenery, Tony" sort="Mcenery, Tony" uniqKey="Mcenery T" first="Tony" last="Mcenery">Tony Mcenery</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Xiao, Richard" sort="Xiao, Richard" uniqKey="Xiao R" first="Richard" last="Xiao">Richard Xiao</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Bontcheva, Kalina" sort="Bontcheva, Kalina" uniqKey="Bontcheva K" first="Kalina" last="Bontcheva">Kalina Bontcheva</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Cunningham, Hamish" sort="Cunningham, Hamish" uniqKey="Cunningham H" first="Hamish" last="Cunningham">Hamish Cunningham</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Gaizauskas, Robert" sort="Gaizauskas, Robert" uniqKey="Gaizauskas R" first="Robert" last="Gaizauskas">Robert Gaizauskas</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Hamza, Oana" sort="Hamza, Oana" uniqKey="Hamza O" first="Oana" last="Hamza">Oana Hamza</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Maynard, Diana" sort="Maynard, Diana" uniqKey="Maynard D" first="Diana" last="Maynard">Diana Maynard</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Tablan, Valentin" sort="Tablan, Valentin" uniqKey="Tablan V" first="Valentin" last="Tablan">Valentin Tablan</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Ursu, Cristian" sort="Ursu, Cristian" uniqKey="Ursu C" first="Cristian" last="Ursu">Cristian Ursu</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Jayaram, B D" sort="Jayaram, B D" uniqKey="Jayaram B" first="B. D." last="Jayaram">B. D. Jayaram</name>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Leisher, Mark" sort="Leisher, Mark" uniqKey="Leisher M" first="Mark" last="Leisher">Mark Leisher</name>
<affiliation wicri:level="1">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>New Mexico State University Computing Labs</wicri:regionArea>
</affiliation>
<affiliation>
<wicri:noCountry code="subField"></wicri:noCountry>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Literary and Linguistic Computing</title>
<title level="j" type="abbrev">Lit Linguist Computing</title>
<idno type="ISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2004-11">2004-11</date>
<biblScope unit="volume">19</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="509">509</biblScope>
<biblScope unit="page" to="524">524</biblScope>
</imprint>
<idno type="ISSN">0268-1145</idno>
</series>
<idno type="istex">FFB348324CCE43FC1008E9CDA475307B9BB54003</idno>
<idno type="DOI">10.1093/llc/19.4.509</idno>
<idno type="local">190509</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
</list>
<tree>
<noCountry>
<name sortKey="Baker, Paul" sort="Baker, Paul" uniqKey="Baker P" first="Paul" last="Baker">Paul Baker</name>
<name sortKey="Bontcheva, Kalina" sort="Bontcheva, Kalina" uniqKey="Bontcheva K" first="Kalina" last="Bontcheva">Kalina Bontcheva</name>
<name sortKey="Cunningham, Hamish" sort="Cunningham, Hamish" uniqKey="Cunningham H" first="Hamish" last="Cunningham">Hamish Cunningham</name>
<name sortKey="Gaizauskas, Robert" sort="Gaizauskas, Robert" uniqKey="Gaizauskas R" first="Robert" last="Gaizauskas">Robert Gaizauskas</name>
<name sortKey="Hamza, Oana" sort="Hamza, Oana" uniqKey="Hamza O" first="Oana" last="Hamza">Oana Hamza</name>
<name sortKey="Hardie, Andrew" sort="Hardie, Andrew" uniqKey="Hardie A" first="Andrew" last="Hardie">Andrew Hardie</name>
<name sortKey="Jayaram, B D" sort="Jayaram, B D" uniqKey="Jayaram B" first="B. D." last="Jayaram">B. D. Jayaram</name>
<name sortKey="Maynard, Diana" sort="Maynard, Diana" uniqKey="Maynard D" first="Diana" last="Maynard">Diana Maynard</name>
<name sortKey="Mcenery, Tony" sort="Mcenery, Tony" uniqKey="Mcenery T" first="Tony" last="Mcenery">Tony Mcenery</name>
<name sortKey="Tablan, Valentin" sort="Tablan, Valentin" uniqKey="Tablan V" first="Valentin" last="Tablan">Valentin Tablan</name>
<name sortKey="Ursu, Cristian" sort="Ursu, Cristian" uniqKey="Ursu C" first="Cristian" last="Ursu">Cristian Ursu</name>
<name sortKey="Xiao, Richard" sort="Xiao, Richard" uniqKey="Xiao R" first="Richard" last="Xiao">Richard Xiao</name>
</noCountry>
<country name="États-Unis">
<noRegion>
<name sortKey="Leisher, Mark" sort="Leisher, Mark" uniqKey="Leisher M" first="Mark" last="Leisher">Mark Leisher</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001553 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001553 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:FFB348324CCE43FC1008E9CDA475307B9BB54003
   |texte=   Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024